Previously:
v5.7
Linux v5.8 was released in August, 2020. Here s my summary of various security things that caught my attention:
arm64 Branch Target Identification
Dave Martin
added support for ARMv8.5 s Branch Target Instructions (
BTI), which are enabled in
userspace at execve() time, and all the time
in the kernel (which required manually marking up a lot of non-C code, like assembly and
JIT code).
With this in place, Jump-Oriented Programming (JOP, where code gadgets are chained together with jumps and calls) is no longer available to the attacker. An attacker s code must make direct function calls. This basically reduces the usable code available to an attacker from every word in the kernel text to only function entries (or jump targets). This is a low granularity forward-edge Control Flow Integrity (CFI) feature, which is important (since it greatly reduces the potential targets that can be used in an attack) and cheap (implemented in hardware). It s a good first step to strong CFI, but (as we ve seen with things like CFG) it isn t usually strong enough to stop a motivated attacker. High granularity CFI (which uses a more specific branch-target characteristic, like function prototypes, to track expected call sites) is not yet a hardware supported feature, but the software version will be coming in the future by way of
Clang s CFI implementation.
arm64 Shadow Call Stack
Sami Tolvanen landed the
kernel implementation of Clang s Shadow Call Stack (
SCS), which protects the kernel against Return-Oriented Programming (ROP) attacks (where code gadgets are chained together with returns). This backward-edge CFI protection is implemented by keeping a second dedicated stack pointer register (
x18
) and keeping a copy of the return addresses stored in a separate shadow stack . In this way, manipulating the regular stack s return addresses will have no effect. (And since a copy of the return address continues to live in the regular stack, no changes are needed for back trace dumps, etc.)
It s worth noting that unlike BTI (which is hardware based), this is a software defense that relies on the location of the Shadow Stack (i.e. the value of
x18
) staying secret, since the memory could be written to directly. Intel s hardware ROP defense (CET) uses a hardware shadow stack that isn t directly writable. ARM s hardware defense against ROP is
PAC (which is actually designed as an arbitrary CFI defense it can be used for forward-edge too), but that depends on having ARMv8.3 hardware. The expectation is that SCS will be used until PAC is available.
Kernel Concurrency Sanitizer infrastructure added
Marco Elver landed support for the
Kernel Concurrency Sanitizer, which is a new debugging infrastructure to find data races in the kernel, via
CONFIG_KCSAN
. This immediately found real bugs, with some fixes having
already landed too. For more details, see the
KCSAN documentation.
new capabilities
Alexey Budankov
added CAP_PERFMON
, which is designed to allow access to
perf()
. The idea is that this capability gives a process access to only read aspects of the running kernel and system. No longer will access be needed through the much more powerful abilities of
CAP_SYS_ADMIN
, which has many ways to change kernel internals. This allows for a split between controls over the confidentiality (read access via CAP_PERFMON) of the kernel vs control over integrity (write access via CAP_SYS_ADMIN).
Alexei Starovoitov
added CAP_BPF
, which is designed to separate BPF access from the all-powerful
CAP_SYS_ADMIN
. It is designed to be used in combination with
CAP_PERFMON
for tracing-like activities and
CAP_NET_ADMIN
for networking-related activities. For things that could change kernel integrity (i.e. write access),
CAP_SYS_ADMIN
is still required.
network random number generator improvements
Willy Tarreau made the
network code s random number generator less predictable. This will further frustrate any attacker s attempts to recover the state of the RNG externally, which might lead to the ability to hijack network sessions (by correctly guessing packet states).
fix various kernel address exposures to non-
CAP_SYSLOG
I fixed several situations where kernel addresses were still being exposed to unprivileged (i.e. non-
CAP_SYSLOG
) users, though usually only through odd corner cases. After
refactoring how capabilities were being checked for files in
/sys
and
/proc
, the
kernel modules sections,
kprobes, and
BPF exposures got fixed. (Though in doing so, I briefly made things much worse before getting it
properly fixed. Yikes!)
RISCV W^X detection
Following up on his recent work to
enable strict kernel memory protections on RISCV, Zong Li has now added
support for CONFIG_DEBUG_WX
as seen for other architectures. Any writable and executable memory regions in the kernel (which are lovely targets for attackers) will be loudly noted at boot so they can get corrected.
execve()
refactoring continues
Eric W. Biederman continued working on
execve()
refactoring, including getting rid of the
frequently problematic recursion used to locate binary handlers. I used the opportunity to dust off some old
binfmt_script
regression tests and get them into the kernel selftests.
multiple
/proc
instances
Alexey Gladkov modernized
/proc
internals and provided a way to have
multiple /proc
instances mounted in the same PID namespace. This allows for having multiple views of
/proc
, with different features enabled. (Including the newly added
hidepid=4 and
subset=pid mount
options.)
set_fs()
removal continues
Christoph Hellwig, with Eric W. Biederman, Arnd Bergmann, and others, have been diligently working to entirely remove the kernel s
set_fs()
interface, which has long been a source of security flaws due to weird confusions about which address space the kernel thought it should be accessing. Beyond things like the lower-level per-architecture
signal handling code, this has needed to touch various
parts of the
ELF loader, and
networking code too.
READ_IMPLIES_EXEC
is no more for native 64-bit
The
READ_IMPLIES_EXEC
flag was a work-around for dealing with the addition of non-executable (NX) memory when x86_64 was introduced. It was designed as a way to mark a memory region as well, since we don t know if this memory region was expected to be executable, we must assume that if we need to read it, we need to be allowed to execute it too . It was designed mostly for stack memory (where trampoline code might live), but it would carry over into all
mmap()
allocations, which would mean sometimes exposing a large attack surface to an attacker looking to find executable memory. While normally this didn t cause problems on modern systems that correctly marked their ELF sections as NX, there were still some awkward corner-cases. I fixed this by splitting
READ_IMPLIES_EXEC
from the ELF
PT_GNU_STACK
marking on
x86 and
arm/arm64, and declaring that a native 64-bit process would never gain
READ_IMPLIES_EXEC
on
x86_64 and
arm64, which matches the behavior of other native 64-bit architectures that correctly didn t ever implement
READ_IMPLIES_EXEC
in the first place.
array index bounds checking continues
As part of the ongoing work to use modern flexible arrays in the kernel, Gustavo A. R. Silva added the
flex_array_size()
helper (as a cousin to
struct_size()
). The
zero/one-member into flex array conversions continue with over a hundred commits as we slowly get closer to being able to build with
-Warray-bounds
.
scnprintf()
replacement continues
Chen Zhou joined Takashi Iwai in
continuing to replace potentially unsafe uses of
sprintf()
with
scnprintf()
. Fixing all of these will make sure the kernel avoids nasty buffer concatenation surprises.
That s it for now! Let me know if there is anything else you think I should mention here. Next up: Linux v5.9.
2021, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 License.